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Abstract 

We explore training an automatic modality 
tagger. Modality is the attitude that a speaker 
might have toward an event or state. One of 
the main hurdles for training a linguistic tag¬ 
ger is gathering training data. This is par¬ 
ticularly problematic for training a tagger for 
modality because modality triggers are sparse 
for the overwhelming majority of sentences. 
We investigate an approach to automatically 
training a modality tagger where we first gath¬ 
ered sentences based on a high-recall simple 
rule-based modality tagger and then provided 
these sentences to Mechanical Turk annotators 
for further annotation. We used the resulting 
set of training data to train a precise modality 
tagger using a multi-class SVM that delivers 
good performance. 


1 Introduction 


Modality is an extra-propositional component of 
meaning. In John may go to NY, the basic propo¬ 
sition is John go to NY and the word may indi¬ 
cates modality. iVan Der Auwera and Ammann 


(20051 define core cases of modality: John must 


go to NY (epistemic necessity), John might go to 
NY (epistemic possibility), John has to leave now 
(deontic necessity) and John may leave now (de- 


ontic possibility). Many semanticists (e.g. Kratzer 


(1981[ ), [Kratzer (1991| l, |Kaufmann et al. (2006[ )) de¬ 


fine modality as quantification over possible worlds. 
John might go means that there exist some possi¬ 
ble worlds in which John goes. Another view of 
modality relates more to a speaker’s attitude toward 
a proposition (e.g. MeShane et al. (20041). 

Modality might be construed broadly to include 
several types of attitudes that a speaker wants to ex¬ 
press towards an event, state or proposition. Modal¬ 
ity might indicate factivity, evidentiality, or senti¬ 
ment (MeShane et al., 20041. Factivity is related to 
whether the speaker wishes to convey his or her be¬ 
lief that the propositional content is true or not, i.e., 
whether it actually obtains in this world or not. It 
distinguishes things that (the speaker believes) hap¬ 
pened from things that he or she desires, plans, or 
considers merely probable. Evidentiality deals with 
the source of information and may provide clues to 
the reliability of the information. Did the speaker 
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have firsthand knowledge of what he or she is re¬ 
porting, or was it hearsay or inferred from indirect 
evidence? Sentiment deals with a speaker’s positive 
or negative feelings toward an event, state, or propo¬ 
sition. 

In this paper, we focus on the following five 
modalities; we have investigated the belief/factivity 
modality previously (|Diab et ah, 2009b| 


2 Related Work 


Prab- 


Ihakaran et ah, 2010] ), and we leave other modalities 
to future work. 

• Ability: can H do P? 

• Effort: does H try to do P? 

• Intention: does H intend P? 

• Success: does H succeed in P? 

• Want: does H want P? 

We investigate automatically training a modality 
tagger by using multi-class Support Vector Ma¬ 
chines (SVMs). One of the main hurdles for training 
a linguistic tagger is gathering training data. This is 
particularly problematic for training a modality tag¬ 
ger because modality triggers are sparse for the over¬ 
whelming majority of the sentences. [Baker et al. 


(20101 created a modality tagger by using a semi¬ 
automatic approach for creating rules for a rule- 
based tagger. A pilot study revealed that it can boost 
recall well above the naturally occurring proportion 
of modality without annotated data but with only 
60% precision. We investigated an approach where 
we first gathered sentences based on a simple modal¬ 
ity tagger and then provided these sentences to an¬ 
notators for further annotation. The resulting anno¬ 
tated data also preserved the level of inter-annotator 
agreement for each example so that learning algo¬ 
rithms could take that into account during training. 
Finally, the resulting set of annotations was used for 
training a modality tagger using SVMs, which gave 
a high precision indicating the success of this ap¬ 
proach. 

Section |2] discusses related work. Section dis¬ 
cusses our procedure for gathering training data. 
Section discusses the machine learning setup 
and features used to train our modality tagger and 
presents experiments and results. Section con¬ 
cludes and discusses future work. 


Previous related work includes TimeML (Sauri et 


ah, 20061, which involves modality annotation on 


events, and Factbank (Sauri and Pustejovsky, 20091, 
where event mentions are marked with degree of fac- 
tuality. Modality is also important in the detection of 
uncertainty and hedging. The CoNLL shared task in 
2010 (Farkas et ah, 20101 ) deals with automatic de¬ 
tection of uncertainty and hedging in Wikipedia and 
biomedical sentences. 

Baker et al. (2010] | and [Baker et al. (20121 ) ana¬ 


lyze a set of eight modalities which include belief, 
require and permit, in addition to the five modalities 
we focus on in this paper. They built a rule-based 
modality tagger using a semi-automatic approach to 
create rules. This earlier work differs from the work 
described in this paper in that the our emphasis is on 
the creation of an automatic modality tagger using 
machine learning techniques. Note that the anno¬ 
tation and automatic tagging of the belief modality 
(i.e., factivity) is described in more detail in ( |Diab et 
ah, 2009bi [Prabhakaran et ah, 2010[ ). 


There has been a considerable amount of inter¬ 
est in modality in the biomedical domain. Negation, 
uncertainty, and hedging are annotated in the Bio¬ 


scope corpus (Vincze et al., 20081, along with infor¬ 
mation about which words are in the scope of nega¬ 
tion/uncertainty. The i2b2 NLP Shared Task in 2010 
included a track for detecting assertion status (e.g. 
present, absent, possible, conditional, hypot hetical 
etc.) of medical problems in clinical records ^|Apos 
tolova et al. (201 1() presents a rule-based system for 


the detection of negation and speculation scopes us¬ 
ing the Bioscope corpus. Other studies emphasize 
the importance of detecting uncertainty in medical 


text summarization (Morante and Daelemans, 2009 


[Aramaki et ah, 200^ . 

Modality has also received some attention in the 
context of certain applications. Earlier work de¬ 
scribing the difficulty of correctly translating modal¬ 
ity using machine translation includes ( [Sigurd and 
Gawronska, 1994 1 and ( Murata et ah, 20(35] l. Sig¬ 


urd et al. (19941 write about rule based frameworks 


and how using alternate grammatical constructions 
such as the passive can improve the rendering of the 


modal in the target language. Murata et al. (20051 


'https://www.i2b2.org/NLP/Relations/ 











































analyze the translation of Japanese into English 
by several systems, showing they often render the 
present incorrectly as the progressive. The authors 
trained a support vector machine to specifically han¬ 
dle modal constructions, while our modal annotation 
approach is a part of a full translation system. 

The textual entailment literature includes modal¬ 
ity annotation schemes. Identifying modalities is 
important to determine whether a text entails a hy¬ 
pothesis. Bar-Haim et al. ( |2007 1 include polarity 
based rules and negation and modality annotation 
rules. The polarity rules are based on an indepen¬ 


dent polarity lexicon (Nairn et ah, 20061. The an¬ 
notation rules for negation and modality of predi¬ 
cates are based on identifying modal verbs, as well 
as conditional sentences and modal adverbials. The 
authors read the modality off parse trees directly us¬ 
ing simple structural rules for modifiers. 

3 Constructing Modality Training Data 

In fhis section, we will discuss fhe procedure we 
followed to construct the training data for build¬ 
ing the automatic modality tagger. In a pilot study, 
we obtained and ran the modality tagger described 


in (Baker et ah, 20101 on the English side of the 
Urdu-English EDC language packj^ We randomly 
selected 1997 sentences that the tagger had labeled 
as not having the Want modality and posted them on 
Amazon Mechanical Turk (MTurk). Three differ¬ 
ent Turkers (MTurk annotators) marked, for each of 
the sentences, whether it contained the Want modal¬ 
ity. Using majority rules as the Turker judgment, 
95 (i.e., 4.76%) of these sentences were marked as 
having a Want modality. We also posted 1993 sen¬ 
tences that the tagger had labeled as having a Want 
modality and only 1238 of them were marked by the 
Turkers as having a Want modality. Therefore, the 
estimated precision of this type of approach is only 
around 60%. 


Hence, we will not be able to use the (Baker et 


ah, 2010] | tagger to gather training data. Instead, 
our approach was to apply a simple tagger as a first 
pass, with positive examples subsequently hand- 
annotated using MTurk. We made use of sentence 
data from the Enron email corpus0derived from the 


version owing to Eiore and Heer|^ further processed 
as described by ( Roark, 2009| lp] 

To construct the simple tagger (the first pass), we 
used a lexicon of modality trigger words (e.g., try, 


plan, aim, wish, want) constructed by Baker et al. 
(20101). The tagger essentially tags each sentence 


that has a word in the lexicon with the corresponding 
modality. We wrote a few simple obvious filters for a 
handful of exceptional cases that arise due to the fact 
that our sentences are from e-mail. Eor example, we 
filtered out best wishes expressions, which otherwise 
would have been tagged as Want because of the word 
wishes. 

The words that trigger modality occur with very 
different frequencies. If one is not careful, the 
training data may be dominated by only the com¬ 
monly occurring trigger words and the learned tag¬ 
ger would then be biased towards these words. In 
order to ensure that our training data had a diverse 
set of examples containing many lexical triggers and 
not just a lot of examples with the same lexical trig¬ 
ger, for each modality we capped the number of sen¬ 
tences from a single trigger to be at most 50. After 
we had the set of sentences selected by the simple 
tagger, we posted them on MTurk for annotation. 

The Turkers were asked to check a box indicat¬ 
ing that the modality was not present in the sentence 
if the given modality was not expressed. If they did 
not check that box, then they were asked to highlight 
the target of the modality. Table [T] shows the number 
of sentences we posted on MTurk for each modal- 
ity0 Three Turkers annotated each sentence. We 
restricted the task to Turkers who were adults, had 
greater than a 95% approval rating, and had com¬ 
pleted at least 50 HITs (Human Intelligence Tasks) 
on MTurk. We paid US$0.10 for each set of ten sen¬ 
tences. 

Since our data was annotated by three Turkers, 
for training data we used only those examples for 
which at least two Turkers agreed on the modality 
and the target of the modality. This resulted in 1,008 
examples. 674 examples had two Turkers agreeing 
and 334 had unanimous agreement. We kept track 
of the level of agreement for each example so that 


^LDC Catalog No.: LDC2006E110. 
^http://www-2.cs.cmu.edu/~enron/ 


"^http://bailando.sims.berkeley.edu/enron/enron.sql.gz 
^Data received through personal communication 
®More detailed statistics on MTurk annotations are available 
at http://hltcoe.jhu.edu/datasets/. 


















Modality 

Count 

Ability 

190 

Effort 

1350 

Intention 

1320 

Success 

1160 

Want 

1390 


Modality 

MTurk 

Gold 

Ability 

6% 

48% 

Effort 

25% 

10% 

Intention 

30% 

11% 

Success 

24% 

9% 

Want 

15% 

23% 


Table 1: For each modality, the number of sentences 
returned by the simple tagger that we posted on MTurk. 


our learner could weight the examples differently 
depending on the level of inter-annotator agreement. 

4 Multiclass SVM for Modality 

In this section, we describe the automatic modal¬ 
ity tagger we built using the MTurk annotations de¬ 
scribed in Section]^ as the training data. Section 4.1 
describes the training and evaluation data. In Sec¬ 
tion |4.2[ we present the machinery and Section |4.3 


describes the features we used to train the tagger. 


In Section 4.4 we present various experiments and 


discuss results. Section 4.5 presents additional ex¬ 
periments using annotator confidence. 

4.1 Data 

For training, we used the data presented in Section]^ 
We refer to it as MTurk data in the rest of this paper. 
For evaluation, we selected a part of the LU Corpus 
(Diab et ah, 2009al (1228 sentences) and our expert 
annotated it with modality tags. We first used the 
high-recall simple modality tagger described in Sec¬ 
tion [3] to select the sentences with modalities. Out 
of the 235 sentences returned by the simple modal¬ 
ity tagger, our expert removed the ones which did 
not in fact have a modality. In the remaining sen¬ 
tences (94 sentences), our expert annotated the tar¬ 
get predicate. We refer to this as the Gold dataset 
in this paper. The MTurk and Gold datasets differ in 
terms of genres as well as annotators (Turker vs. Ex¬ 
pert). The distribution of modalities in both MTurk 
and Gold annotations are given in Table 

4.2 Approach 

We applied a supervised learning framework us¬ 
ing multi-class SVMs to automatically learn to tag 


Table 2; Frequency of Modalities 

modalities in context. For tagging, we used the Yam- 
cha ( Kudo and Matsumoto, 2003] ) sequence labeling 
system which uses the (Joachims, 19991 

package for classification. We used One versus All 
method for multi-class classification on a quadratic 
kernel with a C value of 1. We report recall and pre¬ 
cision on word tokens in our corpus for each modal¬ 
ity. We also report F^=i (F)-measure as the har¬ 
monic mean between (P)recision and (R)ecall. 

4.3 Features 

We used lexical features at the token level which can 
be extracted without any parsing with relatively high 
accuracy. We use the term context width to denote 
the window of tokens whose features are considered 
for predicting the tag for a given token. For example, 
a context width of 2 means that the feature vector 
of any given token includes, in addition to its own 
features, those of 2 tokens before and after it as well 
as the tag prediction for 2 tokens before it. We did 
experiments varying the context width from 1 to 5 
and found that a context width of 2 gives the optimal 
performance. All results reported in this paper are 
obtained with a context width of 2. For each token, 
we performed experiments using following lexical 
features: 

• wordStem - Word stem. 

• wordLemma - Word lemma. 

• POS - Word’s POS tag. 

• isNumeric - Word is Numeric? 

• verbType - Modal/Auxiliary/Regular/Nil 

• whichModal - If the word is a modal verb, 
which modal? 
























We used the Porter stemmer ( [Porter, 1997[ ) to ob¬ 
tain the stem of a word token. To determine the 
word lemma, we used an in-house lemmatizer using 
dictionary and morphological analysis to obtain the 
dictionary form of a word. We obtained POS tags 
from Stanford POS tagger and used those tags to 
determine verbType and whichModal features. The 
verbType feature is assigned a value ‘Nil’ if the word 
is not a verb and whichModal feature is assigned a 
value ‘Nil’ if the word is not a modal verb. The fea¬ 
ture isNumeric is a binary feature denoting whether 
the token contains only digits or not. 


4.4 Experiments and Results 

In this section, we present experiments performed 
considering all the MTurk annotations where two 
annotators agreed and all the MTurk annotations 
where all three annotators agreed to be equally cor¬ 
rect annotations. We present experiments applying 
differential weights for these annotations in Section 


on MTurk data in order to select the best feature 
set configuration cj). The best feature set obtained 
was wordStem, POS, whichModal with a context 
width of 2. For finding fhe besf performing fea- 
fure sef - confexf widfh configuration, we did an ex¬ 
haustive search on fhe feafure space, pruning away 
feafures which were proven nof useful by resulfs af 
sfages. Table presenfs resulfs obfained for each 
modalify on 4-fold cross validation. 


4.5 We performed 4-fold cross validafion (4FCV) 


ifs difference from MTurk dafa. MTurk dafa is en- 
firely from email fhreads, whereas Gold dafa con- 
fained senfences from newswire, leffers and blogs 
in addifion fo emails. Furthermore, fhe annofafion 
is differenf (Turkers vs experf). Finally, fhe disfri- 
bufion of modalifies in bofh dafasefs is very differ¬ 
enf. For example. Ability modality was merely 6% 
of MTurk dafa compared fo 48% in Gold dafa (see 
Table [^. 


Modality 

Precision 

Recall 

F Measure 

Ability 

78.6 

22.0 

34.4 

Effort 

85.7 

60.0 

70.6 

Intention 

66.7 

16.7 

26.7 

Success 

NA 

0.0 

NA 

Want 

92.3 

50.0 

64.9 

Overall 

72.1 

29.5 

41.9 


Table 4; Per modality results for best feature set (f> evalu¬ 
ated on Gold dataset 


We obtained reasonable performances for Ejfort 
and Want modalities while the performance for other 
modalities was rather low. Also, the Gold dataset 
contained only 8 instances of Success, none of which 
was recognized by the tagger resulting in a recall 
of 0%. Precision (and, accordingly, F Measure) for 
Success was considered “not applicable” (NA), as no 
such tag was assigned. 


Modality 

Precision 

Recall 

E Measure 

Ability 

82.4 

55.5 

65.5 

Effort 

95.1 

82.8 

88.5 

Intention 

84.3 

61.3 

70.7 

Success 

93.2 

76.6 

83.8 

Want 

88.4 

64.3 

74.3 

Overall 

90.1 

70.6 

79.1 


Table 3: Per modality results for best feature set cj) on 
4-fold cross validation on MTurk data 

We also trained a model on the entire MTurk data 
using the best feature set (j) and evaluated it against 
the Gold data. The results obtained for each modal¬ 
ity on gold evaluation are given in Table We at¬ 
tribute the lower performance on the Gold dataset to 


4.5 Annotation Confidence Experiments 

Our MTurk data contains sentence for which at least 
two of the three Turkers agreed on the modality and 
the target of the modality. In this section, we investi¬ 
gate the role of annotation confidence in training an 
automatic tagger. The annotation confidence is de¬ 
noted by whether an annotation was agreed by only 
two annotators or was unanimous. We denote the set 
of sentences for which only two annotators agreed as 
Agr 2 and that for which all three annotators agreed 
as Aprs. 

We present four training setups. The first setup 
is Tr23 where we train a model using both Agr 2 
and Agr^ with equal weights. This is the setup we 
used for results presented in the Section 4.4 Then, 
we have Tr2 and Tr3, where we train using only 
Agr 2 and Agr^ respectively. Then, for Tr23w, we 



















Trainings etup 

Tested on Agr 2 and Agr^ 

Tested on Agr^^ only 

Precision 

Recall 

F Measure 

Precision 

Recall 

F Measure 

Tr23 

90.1 

70.6 

79.1 

95.9 

86.8 

91.1 

Tr2 

91.0 

66.1 

76.5 

95.6 

81.8 

88.2 

Tr3 

88.1 

52.3 

65.6 

96.8 

71.7 

82.3 

Tr23w 

89.9 

70.5 

79.0 

95.8 

86.5 

90.9 


Table 5: Annotator Confidence Experiment Results; the best results per column are boldfaced 
(4-fold cross validation on MTurk Data) 


train a model giving different cost values for Agr 2 
and Agr^ examples. The SVMLight package al¬ 
lows users to input cost values c* for each training 
instance separately]^ We tuned this cost value for 
Agr 2 and Agr^ examples and found the best value 
at 20 and 30 respectively. 

For all four setups, we used feature set 4>. We per¬ 
formed 4-fold cross validation on MTurk data in two 
ways — we tested against a combination of Agr 2 
and Agr^, and we tested against only Agr^. Results 
of these experiments are presented in Table We 
also present the results of evaluating a tagger trained 
on the whole MTurk data for each setup against the 
Gold annotation in Table[^ The Tr23 tested on both 
Agr 2 and Agr^ presented in Tablej^and Tr23 tested 
on Gold data presented in Table [^correspond to the 
results presented in Tablej^and Table [^respectively. 


Trainings etup 

Precision 

Recall 

F Measure 

Tr23 

72.1 

29.5 

41.9 

Tr2 

67.4 

27.6 

39.2 

Tr3 

74.1 

19.1 

30.3 

Tr23w 

73.3 

31.4 

44.0 


Table 6: Annotator Confidence Experiment Results; the 
best results per column are boldfaced 

(Evaluation against Gold) 

One main observation is that including annota¬ 
tions of lower agreement, but still above a threshold 
(in our case, 66.7%), is definitely helpful. Tr23 out¬ 
performed both Tr2 and T r3 in both recall and F- 

’This can be done by specifying ‘cost:<value>’ after the 
label in each training instance. This feature has not yet been 
documented on the SVMlight website. 


measure in all evaluations. Also, even when evaluat¬ 
ing against only the high confident Agr^ cases, Tr2 
gave a high gain in recall (10 .1 percentage points) 
over Tr3, with only a 1.2 percentage point loss on 
precision. We conjecture that this is because there 
are far more training instances in Tr2 than in Tr3 
(674 vs 334), and that quantity beats quality. 

Another important observation is the increase in 
performance by using varied costs for Agr 2 and 
Agr^, examples (the Tr23w condition). Although 
it dropped the performance by 0.1 to 0.2 points 
in cross-validation F measure on the Enron cor¬ 
pora, it gained 2.1 points in Gold evaluation F mea¬ 
sure. These results seem to indicate that differen¬ 
tial weighting based on annotator agreement might 
have more beneficial impacf when fraining a model 
fhaf will be applied fo a wide range of genres fhan 
when fraining a model wifh genre-specific dafa for 
application fo dafa from fhe same genre. Puf differ- 
enfly, using varied cosfs prevenfs genre over-fitfing. 
We don’f have a full explanafion for fhis difference 
in behavior yef. We plan fo explore fhis in fufure 
work. 

5 Conclusion 

We have presenfed an innovative way of combining 
a high-recall simple tagger with Mechanical Turk 
annotations to produce training data for a modality 
tagger. We show that we obtain good performance 
on the same genre as this training corpus (annotated 
in the same manner), and reasonable performance 
across genres (annotated by an independent expert). 
We also present experiments utilizing the number of 
agreeing Turkers to choose cost values for training 
examples for the SVM. As future work, we plan to 



















extend this approaeh to other modalities whieh are 
not eovered in this study. 
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